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Sentence Directed Video Object Codetection 

Haonan Yu, Student Member, IEEE and Jeffrey Mark Siskind, Senior Member, IEEE 


Abstract —We tackle the problem of video object codetection by leveraging the weak semantic constraint implied by sentences that 
describe the video content. Unlike most existing work that focuses on codetecting large objects which are usually salient both in size 
and appearance, we can codetect objects that are small or medium sized. Our method assumes no human pose or depth information 
such as is required by the most recent state-of-the-art method. We employ weak semantic constraint on the codetection process by 
pairing the video with sentences. Although the semantic information is usually simple and weak, it can greatly boost the performance of 
our codetection framework by reducing the search space of the hypothesized object detections. Our experiment demonstrates an 
average loU score of 0.423 on a new challenging dataset which contains 15 object classes and 150 videos with 12,509 frames in total, 
and an average loU score of 0.373 on a subset of an existing dataset, originally intended for activity recognition, which contains 5 
object classes and 75 videos with 8,854 frames in total. 

Index Terms —video, object codetection, sentences 
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1 Introduction 

I N this paper, we address the problem of codetecting 
objects with bounding boxes from a set of videos, without 
any pretrained object detectors. The codetection problem is 
typically approached by selecting one out of many object 
proposals per image or frame that maximizes a combination 
of the confidence scores associated with the selected propos¬ 
als and the similarity scores between proposal pairs. While 
much prior work focuses on codetecting objects in still im¬ 
ages {e.g., |Zl|25ll39lll2l), little prior work EllISl|35lH qIHT] 
attempts to codetect objects in video. In both lines of work, 
most IZl |22l |25l [SH |39l SSJ assume that the objects to be 
codetected are salient, both in size and appearance, and 
located in the center of the field of view. Thus they easily 
"'pop out." As a result, prior methods succeed with a small 
number of object proposals in each image or frame. Tang 
et al. 1421 and Joulin et al. |22j used approximately 10 to 20 
proposals per image, while Lee and Grauman Il25l used 50 
proposals per image. Limiting codetection to objects in the 
center of the field of view allowed Prest et al. (SI to prune 
the search space by penalizing proposals in contact with 
the image perimeter. Moreover, under these constraints, 
the confidence score associated with proposals is a reliable 
measure of salience and a good indicator of which image 
regions constitute potential objects |33|. In prior work, the 
proposal confidence dominates the overall scoring process 
and the similarity measure only serves to refine the con¬ 
fidence. In contrast, Srikantha and Gall HD attempt to 
codetect small to medium sized objects in video, without 
the above simplifying assumptions. However, in order to 
search through the larger resulting object proposal space, 
they avail themselves of human pose and depth information 
to prune the search space. It should also be noted that all 
these codetection methods, whether for images or video, 
codetect only one common object at a time: different object 
classes are codetected independently. 

The confidence score of a proposal can be a poor indica- 
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tor of whether a proposal denotes a salient object, especially 
when objects are occluded, the lighting is poor, or motion 
blur exists (e.g., see Figure [^. Salient objects can have low 
confidence score while nonsalient objects or image regions 
that do not correspond to objects can have high confidence 
score. Thus our scoring function does not use the confidence 
scores produced by the proposal generation mechanism. 
Moreover, our method does not rely on human pose and 
depth information, which is not always available. Human 
pose can be difficult to estimate reliably when a person is 
only partially visible or is self-occluded (3|, as is the case 
with most of our videos. 

We avail ourselves of a different source of constraint on 
the codetection problem. In videos depicting human inter¬ 
action with objects to be codetected, descriptions of such 
activity can impart weak spatial or motion constraint either 
on a single object or among multiple objects of interest. For 
example, if the video depicts a "pick up" event, some object 
should have an upward displacement during this process, 
which should be detectable even if it is small. This motion 
constraint will reliably differentiate the object which is being 
picked up from other stationary background objects. It is 
weak because it might not totally resolve the ambiguity; 
other image regions might satisfy this constraint, perhaps 
due to noise. Similarly, if we know object A is on the left 
of object B, then the detection search for object A will 
weakly affect the detection search for object B, and vice 
versa. To this end, we extract spatio-temporal constraints 
from sentences that describe the videos and then impose 
these constraints on the codetection process to find the most 
salient collections of objects that satisfy these constraints. 
Even though the constraints implied by a single sentence 
are usually weak, when accumulated across a set of videos 
and sentences, they together will greatly prune the detection 
search space. We call this process sentence directed video 
object codetection. It can be viewed as the inverse of video 
captioning/description 13 [l4l [TTl where object evidence 
(detections or other visual features) is first produced by 
pretrained detectors and then sentences are generated given 
the object appearance and movement. 
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Fig. 1. Object proposal confidence scores and saliency scores for a sample frame from our new dataset. Left: the original input video frame. Middle: 
several proposals and associated confidence scores produced by the method of Arbelaez et al. (U. Note that the red boxes, which do not correspond 
to objects, let alone salient ones, all have higher scores than the green box, which does denote a salient object. Right: the saliency map output by 
the saliency detection method of Jiang et al. |2]], currently the highest ranking method on the MIT saliency benchmark [To]. Note that the cooler is 
not highlighted as salient. Using these scores as part of the scoring function can drive the codetection process to produce undesired results. 


Generally speaking, we extract a set of predicates from 
each sentence and formulate each predicate around a 
set of primitive functions. The predicates may be verbs 
{e.g., CARRIED and ROTATED), Spatial-relation prepositions 
{e.g., toTheLeetOe and above), motion prepositions {e.g., 
awayFrom and TOWARDS), or adverbs {e.g., QUICKLY and 
slowly). The sentential predicates are applied to the can¬ 
didate object proposals as arguments, allowing an overall 
predicate score to be computed that indicates how well these 
candidate object proposals satisfy the sentence semantics. 
We add this predicate score into the codetection framework, 
on top of the original similarity score, to guide the optimiza¬ 
tion. To the best of our knowledge, this is the first work that 
uses sentences to guide generic video object codetection. To 
summarize, our approach differs from the indicated prior 
work in the following ways: 

(a) Our method can codetect small or medium sized non¬ 
salient objects which can be located anywhere in the 
field of view. 

(b) Our method does not require or assume human pose or 
depth information. 

(c) Our method can codetect multiple objects simultane¬ 
ously. These objects can be either moving in the fore¬ 
ground or stationary in the background. 

(d) Our method allows fast object movement and motion 
blur. Such is not exhibited in prior work. 

(e) Our method leverages sentence semantics to help code¬ 
tection. 

We evaluate our approach on two different datasets. The 
first is a new dataset that contains 15 distinct object classes 
and 150 video clips with a total of 12,509 frames. The second 
is a subset of CAD-120 1241 , a dataset originally intended 
for activity recognition, that contains 5 distinct object classes 
and 75 video clips with a total of 8,854 frames. Our approach 
achieves an average loU (Intersection-over-Union) score of 
0.423 on the former and 0.373 on the latter. It yields an 
average detection accuracy of 0.7 to 0.8 on the former (when 
the loU threshold is 0.4 to 0.3) and 0.5 to 0.6 on the latter 
(when the loU threshold is 0.4 to 0.3). 

2 Related Work 

Corecognition is a simpler variant of codetection 1441 , where 
the objects of interest are sufficiently prominent in the field 
of view that the problem does not require object localization. 
Thus corecognition operates like unsupervised clustering, 
using feature extraction and the similarity measure. Code¬ 
tection |7l|25l[42l additionally requires localization, often by 


putting bounding boxes around the objects. This can require 
combinatorial search over a large space of possible object 
locations. One way to remedy this is to limit the space of 
possible object locations to those produced by an object 
proposal method lH [H [TTJ 133 . These methods typically 
associate a confidence score with each proposal which can 
be used to prune or prioritize the search. Codetection is typi¬ 
cally formulated as the process of selecting one proposal per 
image or frame, out of the many produced by the proposal 
mechanism, that maximizes the collective confidence of and 
similarity between the selected proposals. This optimization 
is usually performed with Belief Propagation |[32| or with 
nonlinear programming. Recently, the codetection problem 
has been extended to video ||22l[31[35ll40l|4Tl- Like Srikantha 
and Gall HT), we codetect small and medium objects, but do 
so without using human pose or a depth map. Like Schulter 
et al. 1401 , we codetect both moving and stationary objects, 
but do so with a larger set of object classes and a larger video 
corpus. Also, like Ramanathan et al. l35l , we use sentences 
to guide video codetection, but do so for a vocabulary that 
goes beyond pronouns, nominals, and names that are used 
to codetect only human face tracks. 

Another line of work learns visual structures or models 
from image captions (61 ITsHSOl l28l l30l , treating the input 
as a parallel image-text dataset. Since this work focuses on 
images and not video, the sentential captions only contain 
static concepts, such as the names of people or the spatial 
relations between objects in the images. In contrast, our 
approach models the motion and changing spatial relations 
that are present only in video as described by verbs and 
motion prepositions in the sentential annotation. 

3 Sentence Directed Codetection 

Our sentence-directed codetection approach is illustrated in 
Figure The input is a set of videos paired with human- 
elicited sentences, one sentence per video. A collection of 
object-candidate generators and video-tracking methods are 
applied to each video to obtain a pool of object propos- 
als|d Object instances and predicates are extracted from the 
paired sentence. Given multiple such video-sentence pairs, a 
graph is formed where object instances serve as vertices and 
similarities between object instances and predicates linking 
object instances in a sentence serve as edges. Finally, Belief 

1. For clarity, in the remainder of this paper, we refer to object 
proposals for a single frame as object candidates, while we refer to 
object tubes or tracks across a video as object proposals. 
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The man removed the violet 
cabbage from the bowl. 



The person carried the squash to the 
left, away from the yellow bowl. 



The person is placing the mouthwash 
next to the cabbage in the sink. 


SIMILAR(bowlO,bowll) 


SIMILAR(cabbageO,cabbagel) 



LEFTWARDS(squashO) 

AWAYFROM(squashO,bowll) 


DOWN(mouthwashO) 

NEAR(mouthwashO,cabbagel) 


Output 




Fig. 2. An overview of our codetection process. Left: input a set of videos paired with sentences. Middle: sentence directed codetection, 
where black bounding boxes represent object proposals. Right: output original videos with objects codetected. Note that no pretrained object 
detectors are used in this whole process. Also note how sentence semantics plays an important role in this process: it provides both unary 
scores, e.g., LEFTWARDS(sQi/as/70) and DoysiN{mouthwashO), for proposal confidence, and binary scores, e.g., out f ROM{cabbage0, bowlO) and 
NEAR (/7?oi/f/7 was/70, cabbagel), for relating multiple objects in the same video. (Best viewed in color.) 


Propagation is applied to this graph to jointly infer object 
codetections. 

3.1 Sentence Semantics 

Our main contribution is exploiting sentence semantics 
to help the codetection process. We use a conjunction of 
predicates to represent (a portion of) the semantics of a 
sentence. Object instances in a sentence fill the arguments 
of the predicates in that sentence. An object instance that 
fills the arguments of multiple predicates is said to be 
coreferenced. For a coreferenced object instance, only one 
track is codetected. For example, a sentence like "the person 
put the mouthwash into the sink near the cabbage" implies the 
following conjunction of predicates: 

DOWN (mouthwash) A NEAR(mouthwash^ cabbage) 

In this case, mouthwash is coreferenced by the predicates 
DOWN (fills the sole argument) and NEAR (fills the first 
argument). Thus only one mouthwash track will be produced, 
simultaneously constrained by the two predicates (Figure]^ 
blue track). 

In principle, one could map sentences to conjunctions of 
our predicates using standard semantic parsing techniques 
iniiii. However, modern semantic parsers are domain 
specific, and employ machine-learning methods to train a 
semantic parser for a specific domain. No existing semantic 
parser has been trained on our domain. Training a new se¬ 
mantic parser requires a parallel corpus of sentences paired 
with intended semantic representations. Modern semantic 
parsers are trained with corpora like PropBank |3H that 
have tens of thousands of manually annotated sentences. 
Gathering such a large training corpus would be overkill for 


our experiments that involve only a few hundred sentences, 
especially since such is not our focus or contribution. Thus 
like Lin et al. |26|, Kong et al. l23l , and Plummer et al. l33l , 
we employ simpler handwritten rules to fully automate the 
semantic parsing process for our limited corpus. Nothing, 
in principle, precludes using a machine-trained semantic 
parser in its place. 

Our semantic parser employs seven steps. 

1) Spelling errors are corrected with Ispell. 

2) The NLTK parsei|^ is used to obtain the POS tags for 
each word in the sentence. 

3) POS tagging errors are corrected by a postprocessing 
step with a small set of rules (Table 

4) Words with a specified set of POS tag^are eliminated. 

5) NLTK is used to lemmatize all nouns and verbs. 

6) Synonyms are conflated by mapping phrases to a 
smaller set of nouns and verbs using a small set of rules 
(Table [^). 

7) A small set of rules map the resulting word strings to 
predicates (Table [^). 

The entire process is fully automatic and implemented in 
less than two pages of Python code. 

The rules employed by the last step of the above pro¬ 
cess generate a weak semantic representation, containing 
only those predicates that are relevant to our codetec¬ 
tion process. For example, for the phrase "into the sink" 
in the above sentence, it is beyond our interest to de¬ 
tect the object sink. Thus our predefined rules generate 
DOWN (mouthwash) instead of INTO (mouthwash^ sink). Also, 

2. http : / /www. nltk . org/ 

3. PRP$ /possessive-pronoun, RN / adverb, , / comma, . /period, 
JJ / adjective, CC/coordinating-conjunction, CD/cardinal-number, 
DT/determiner, and JJR/adjective-comparative 
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towards/IN 

{gas can,gasoline can,gasoline tank} 

ice chest/NN 

{put, set, place} 

watering can/NN 

{pick up, lift} 

vegetable/NN 

{milk, almond milk, carton} 

watering pot/NN 

{cooler, ice chest} 

gas can/NN 

{table, tennis table, ping pong table} 

poured/YBE) 

{ground, driveway,floor} 

pineapple/NN 

{bowl, dish, plate} 

box/NN 

{bucket, pail} 

blue/]] 

{cabbage, vegetable} 

violet/]] 

{box, cardboard box} 

pours/YBZ 

{person, man, boy} 

underneath/IN 

{leftwards, leftward} 

off/lN 

{out, outside} 

inside/IN 

{into, inside of, inside} 

place/YE 

{towards, toward} 


{on, onto} 

up/RB 

{put, set, stack, place} 

down/RB 

{pick up, raise, lift} 

blue/]] 

{drink, take drink, pick up drink} 

drank/YBE> 

{take, remove} 

pours/YBZ 

{cup, container, mug, tin} 

poured/YBD 

{cereal, cereal box, box} 

places/YBZ 

{water, liquid} 

reaches /YBZ 

{ground, floor} 

bowl/NN 

{on, onto} 

(a) 

(b) 


person carry x rightwards put it near y 
person carry x leftwards put it near y 
person carry x to right near y 
person carry x to left near y 
person put x on right near y 
person move x from left to right near y 
person carry x to left put it near y 

person move x out of y 

person take x out of y put it to right 

person take x out of y 

person take x from y put it on counter 


> ^ MOVEHORIZONTAL(x) a NEAREND(a:, y) 


> ^ INStart(x, y) A AWAYFR0M(x, y) 


person put x into table 
person put x on table 
person put down x 
person put x down on table 
person put x down 

person pour x into y 
person pick up x pour it into y 
person pick up pour x into y 

TABLE 1 


moveDown(x) 


rotate(x) a OVER(x, y) 
(C) 


Sample rules from (a) step[^ (b) step[^ and (c) stepj^of our semantic parser, (top) For our new dataset, (bottom) For the subset of CAD-120. 


although a more detailed semantic representation for this 
sentence would include FUT{person^mouthwash), we sim¬ 
plify this two-argument predicate to a one-argument pred¬ 
icate MOVE {mouthwash), since we do not attempt to code¬ 
tect people. To ensure that we do not introduce surplus 
semantics, the generated predicates always implies a weaker 
constraint than the original sentence. 

Each predicate is formulated around a set of prim¬ 
itive functions on the arguments of the predicate. The 
primitive functions produce scores indicating how well 
the arguments satisfy the constraint. The aggregate score 
over the functions constitutes the predicate score. Table 
shows the complete list of our 24 predicates and the scores 
they compute. The function medFIMg(p) computes the me¬ 
dian of the average optical flow magnitude within the 
detections for the proposal p. The functions x{p^^^) and 
y(p(^)) return the x- and ^-coordinates of the center of p^^\ 
normalized by the frame width and height, respectively. 
The function distLessThan(x, a) is defined as log[l/(l + 
exp(—6(x — a)))], where we set b = —20 in the exper¬ 
iment. Similarly, the function distGreaterThan(x, a) is de¬ 
fined as distLessThan(— X, —a). The function 6 \s\{pi\p 2 ^) 
computes the distance between the centers of p^^ and 
P 2 \ also normalized by the frame size. The function 
smaller(p(^\p 2 ^^) returns 0 if the size of is smaller than 
that of , and —oc otherwise. The function tempCoher(p) 
evaluates whether the position of proposal p changes during 
the video, by checking the position offsets between every 
two frames. A higher tempCoher score indicates that p 
is more likely to be stationary in the video. The function 
rotAngle(p*^^^) computes the current rotated angle of the 
object inside p^^^ by looking back 1 second (30 frames). 
We extract SIFT features t27l for both p^^^ and and 

match them to estimate the similarity transformation matrix, 
from which the angle can be computed. Finally the func¬ 


tion hasRotation(a, /3) computes the rotation log-likelihood 
given angle a through the von Mises distribution for which 
we set the location p = P and the concentration k = 4. 

3.2 Generating Object Proposals 

We first generate N object candidates for each video frame. 
We use EdgeBoxes |4^ to obtain the ^ top-ranking object 
candidates and MCG IH to obtain the other half, filtering out 
candidates larger than ^ of the video-frame size to focus 
on small and medium-sized objects. This yields NT object 
candidates for a video with T frames. We then generate K 
object proposals from these NT candidates. To obtain object 
proposals with object candidates of consistent appearance 
and spatial location, one would nominally require that 
K <C N^^. To circumvent this, we first randomly sample a 
frame t from the video with probability proportional to the 
averaged magnitude of optical flow |l3 within that frame. 
Then, we sample an object candidate from the N candidates 
in frame t. To decide whether the object is moving or not, 
we sample from {moving,STATIONARY} with distribution 
We sample a MOVING object candidate with prob¬ 
ability proportional to the average flow magnitude within 
the candidate. Similarly, we sample a STATIONARY object 
candidate with probability inversely proportional to the 
average flow magnitude within the candidate. The sampled 
candidate is then propagated (tracked) bidirectionally to the 
start and the end of the video. We use the CamShift algo¬ 
rithm m to track both MOVING and STATIONARY objects, 
allowing the size of MOVING objects to change during the 
process, but requiring the size of STATIONARY objects to 
remain constant. STATIONARY objects are tracked to account 
for noise or occlusion that manifests as small motion or 
change in size. We track MOVING objects in HSV color 
space and STATIONARY objects in RGB color space. We do 
not use optical-flow-based tracking methods since these 
methods suffer from drift when objects move quickly. We 
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ADISTLARGE = 0.25 AdISTSMALL = 0.05 AANGLE = 7r/2 


MOVE(p) 

MOVEUp(p) 

MOVEDOWN(p) 

MOVEVERTICAL(p) 

MOVELEFTWARDS(p) 

moveRightwards(p) 

MOVEHORIZONTAL(p) 
ROTATE (p) 
TO WARDS (P1,P2) 
AWAYFrOM(pi,P2) 
leftOfStart(pi , P 2 ) 
leftOfEnd(pi,p2) 
rightOfStart(pi , P 2 ) 
RIGHTOfEND(pi , P 2 ) 
ONTOPOfStART(pi , P 2 ) 


ONTOPOfEND(pi , P2 ) 


nearStart(pi,p2) 
nearEnd(pi,p2) 
inStart(pi,p 2) 
INEND(pi,P2) 
BELOWSTART(pi , P 2 ) 

belowEnd(pi,p2) 

ABOVESTART(pi , P 2 ) 
ABOVEEND(pi,P2) 
OVER(pi,P2) 


= medFIMg(p) 

= MOVE(p) + distLessThan (y(p('^)) - y(p^^)), -AdistLarge) 

= move(p) + distGreaterThan (y(p('^^) - y(p(i)), AdistLarge) 

= MOVE(p) + distGreaterThan |^|y(p^^)) -y(p^^^)| ,AdistLarge^ 

= move(p) + distLessThan (x(p(^)) - x(p(^)), -AdistLarge) 

= move(p) + distGreaterThan (x(p('^)) - x(p(i)), AdistLarge) 

= move(p) + distGreaterThan |^|x(p(^)) -x(p(i))| ,AdistLarge^ 

= move(p) + maxhasRotation (rotAngle(p*^*)), Aangle) 

= MOVE(pi) + distLessThan ^dist(p^^\p^^^) - dist(p^^\p^^^), -AdistLarge^ 

= MOVE(pi) + distGreaterThan ^dist(p^^^p^^^) - dist(p^^\p^^^), AdistLarge^ 
= tempCoher(p2) + distLessThan fx(p^^^) - x(p^^^),-A distSmall^ 

= tempCoher(p2) + distLessThan ( x(p^^^) - x(p^^^), -AdistSmall j 
= tempCoher(p2) + distGreaterThan [x(p^^^) - x(p^^^), AdistSmall j 
= tempCoher(p2) + distGreaterThan ^x(p^^^) - x(p^^^), AdistSmall^ 

= tempCoher(p2) 

+distGreaterThan ^y(Pi^^) - y(P2^^), - 2 AdistLarge^ 

+distLessThan ^y(p^^^) -y(p^^^), 0 ^ 

+distLessThan ^|x(p^^^) - x(p^^^)| , 2 AdistSmall^ 

= tempCoher(p2) 


+distGreaterThan 


(y{p^P) - y{p\ 


,(T)^ 


-2AdistLarge ) 


+distLessThan ^y(Pi^^) - y(p 2 ^^),oj 
+distLessThan ^|x(p^^^) -x(p^^^)| ,2 AdistSmall^ 

= tempCoher(p 2 ) + distLessThan rdist(p^^^p^^^), 2 AdistSmall^ 

= tempCoher(p 2 ) + distLessThan ^dist(p^^\p^^^), 2 AdistSmall^ 

= tempCoher(p 2 ) + nearStart(pi,p 2 ) + smaller(p^^\p^^^) 

= tempCoher(p 2 ) + nearEnd(pi,p 2 ) + smaller(p^^\p^^^) 

= tempCoher(p 2 ) + distGreaterThan fy(Pi^^) - y(p 2 ^^), AdistSmall^ 

= tempCoher(p 2 ) + distGreaterThan ^y(Pi^^) - y(P 2 ^^), AdistSmall^ 
= tempCoher(p 2 ) + distLessThan fy(p^^^) - y(p^^^),-A distSmall^ 

= tempCoher(p 2 ) + distLessThan ^y(Pi^^) - v { p ^^), - AdistSmall^ 

= tempCoher(p 2 ) 

+distLessThan ^y(p^*^) - y(p^*^),-A distSmall^ \ 


V 


distLessThan I x(p'^ - x(p^ , AdistLarge 1 


the distances are linearly scaled to [0,1] and converted to 
log similarity scores. Finally, the similarity between two 
proposals pi and p 2 is taken to be: 

9 {Pi,P 2 ) = median p(6™, 6™) 

m 


3.4 Joint Inference 

We extract object instances (see all 15 classes for our new 
dataset and all 5 classes for our subset of CAD-120 in Sec¬ 
tion]^ from the sentences and model them as vertices in a 
graph. Each vertex v can be assigned one of the K proposals 
in the video that is paired with the sentence in which the 
vertex occurs. The score of assigning a proposal ky to a 
vertex v is taken to be the unary predicate score hy{ky) 
computed from the sentence (if such exists, or otherwise 0). 
We construct an edge between every two vertices v and u 
that belong to the same object class. We denote this class 
membership relation as {v,u) G C. The score of this edge 
(v^u), when the proposal ky is assigned to vertex v and 
the proposal ku is assigned to vertex u, is taken to be the 
similarity score gy^u{ky, ku) between the two proposals, as 
described in Section |3.3| Similarly, we also construct an 
edge between two vertices v and u that are arguments of 
the same predicate. We denote this predicate membership 
relation as {v^u) G V. The score of this edge (v^u), when 
the proposal ky is assigned to vertex v and the proposal ku 
is assigned to vertex u, is taken to be the predicate score 
hy,u{kv^ku) between the two proposals, as described in 
Section |3.1[ Our problem, then, is to select a proposal for 
each vertex that maximizes the joint score on this graph, i.e., 
solving the following optimization problem: 


TABLE 2 

Our predicates and their semantics. For simplicity, we show the 
computation on only a single first frame or last frame of a 
proposal. In practice, to reduce noise, all of the scores are averaged 
over the first or last L frames. 


repeat this sampling and propagation process K times to 
obtain K object proposals {pk } for each video. Examples of 
the sampled proposals {K = 240) are shown in the middle 
column of Figure 


max E hy(^ky) “h ^ ^ gV,u(^kv)^u) ^ ^ ky^u(^^V)^u) 

V (v,w)gC {v,u)£V 

where k is the collection of the selected proposals for all the 
vertices. Note that the unary and binary scores are equally 
weighted in the above objective function. This discrete in¬ 
ference problem on graphical models can be solved approxi¬ 
mately by Belief Propagation (3^ . In the experiment, we use 
the OpenGM ||2| implementation to find the approximate 
solution. 


3.3 Similarity between Object Proposais 

We compute the appearance similarity of two object pro¬ 
posals as follows. We first uniformly sample M detections 
{^m} each proposal along its temporal extent. For each 
sampled detection, we extract PHOW IH and HOG fT3l 
features to represent its appearance and shape. We also do 
so after we rotate this detection by 90°, 180°, and 270°. Then, 
we measure the similarity g between a pair of detections 
and with: 


9{b' 


1 5 


b^) = max 


g^2{roti{bf),rotj{bf)) \ 
+ 5L2(roL(65"),rotj(6^)) J 


where rob i = 0,1, 2, 3 represents rotation by 0°, 90°, 180°, 
and 270°, respectively. We use g ^2 to compute the dis¬ 
tance between the PHOW features and ^^2 fo compute the 
Euclidean distance between the HOG features, after which 


4 Experiment 

Our method can only be applied to datasets with the follow¬ 
ing properties: 

I) It can only be applied to video that depicts motion and 
changing spatial relations between objects. 

II) It can only be applied to video, not images, because it 
relies on such motion and changing spatial relations. 

III) The video must be paired with sentences that describe 
that motion and those changing spatial relations. Some 
existing image and video corpora are paired with sen¬ 
tences that do not describe such. 

IV) The objects to be codetected must be detectable by 
existing object proposal methods. 

V) There must be different clips that all involve different 
instances of the same object class participating in the 
described activity. This is necessary to support codetec¬ 
tion. 
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Our new dataset, codetection set # 


Our subset of CAD-120, codetection set # 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

_ 

_ 

LJ 

1 5 

Scene 

Kl 

k2 

k2,3 

k4 

B 

B 

G 

k1,2,3 

B&G 

k1,2,3,4 


Objects 

box 

cabbage 
cojfee grinder 
mouthwash 
pineapple 
squash 

bowl 

cabbage 

cojfee grinder 

mouthwash 

pineapple 

squash 

bowl 

cabbage 

pineapple 

squash 

cup 

juice 

ketchup 

milk 

box 

cooler 

box 

cooler 

bucket 
gas can 
watering pot 

bowl 

cabbage 

pineapple 

squash 

box 
bucket 
cooler 
gas can 
watering pot 

bowl 

cabbage 

cup 

juice 

ketchup 

milk 

mouthwash 

pineapple 

bowl 

cereal 

cup 

m 

microwave 

bowl 

cereal 

cup 

m 

microwave 

bowl 

cereal 

cup 

m 

microwave 

bowl 

cereal 

cup 

m 

microwave 

bowl 

cereal 

cup 

m 

microwave 

# of videos 

26 

27 

17 

21 

19 

17 

23 

17 

25 

24 

15 

15 

15 

15 

15 

# vertices in run 1 

33 

29 

24 

41 

25 

26 

32 

21 

35 

39 

29 

27 

24 

26 

27 

# vertices in run 2 

34 

37 

32 

46 

24 

22 

27 

26 

32 

41 

25 

27 

24 

22 

27 

# vertices in run 3 

33 

38 

31 

36 

24 

22 

33 

27 

35 

39 

25 

26 

21 

23 

26 


TABLE 3 

The experimental setup of the 10 codetection sets for our new dataset and the 5 codetection sets for our subset of CAD-120. 



Fig. 3. Examples of the 15 object classes to be codetected in our new dataset and the 5 object classes to be codetected in our subset of CAD-120. 
From left to right: the object classes in our new dataset, bowl, box, bucket, cabbage, coffee grinder, cooler, cup, gas can. Juice, ketchup, milk, 
mouthwash, pineapple, squash, and watering pot, and the object classes in our subset of CAD-120, bowl, cereal, cup, jug, and microwave. 


Most existing datasets do not meet the above criteria and are 
thus ill suited to our approach. We evaluate on two specific 
datasets that are suited. Most existing methods require 
properties that our datasets lack. For example, Srikantha 
and Gall mi require depth and human pose information. 
Others, such as Prest et al. Schulter et al. Sol, Joulin 
et al. EH, and Ramanathan et al. 1351 do not make code 
available. Thus neither can one run our method on existing 
datasets or existing methods on our datasets. 


It is not possible to compare our method to existing 
image codetection methods or evaluate on existing im¬ 
age codetection datasets, or any existing image captioning 
datasets, because they lack properties and III Further, 
it is not possible to compare our method to existing video 
codetection methods or existing video codetection datasets. 
Schulter et al. BOl and Ramanathan et al. ESI address 
different problems with datasets that are highly specific to 
those problems and are thus incomparable. The dataset used 
by Prest et al. 131 and Joulin et al. |l22l lacks properties 
and|I^ Srikantha and Gall evaluate on three datasets: 
ETHZ-activity CSl, CAD-120, and MPII-cooking f3l. Two 
of the these, namely ETHZ-activity and MPII-cooking, lack 
properties [in| and [rv[ Srikantha and Gall rely on depth 

and human pose information to overcome the lack of prop¬ 
erty]^ Moreover, the kinds of activity depicted in ETHZ- 
activity and MPII-cooking cannot easily be formulated in 
terms of descriptions of object motion and changing spatial 
relations. We do apply our method to a subset of CAD- 
120. However, because we do not use depth and human 
pose information, we only consider that subset of CAD- 
120 that satisfies property]^ Srikantha and Gall HT] apply 
their method to a different subject, rendering their results 
incomparable with ours. Moreover, we use incompatible 
sources of information: we use sentences but they do not; 
they use depth and human pose but we do not. This it is 
impossible to perform an apples-to-apples comparison, even 
on the common subset. 


There exist a large number of video datasets that are not 
used for codetection but rather are used for others purposes 


like activity recognition and video captioning. Sentential an¬ 
notation is available for some of these, like MPII-cooking, M- 
VAD I43l , and MPII-MD 1371 . However, the vast majority of 
the clips in M-VAD (48,986 clips annotated with sentences) 
and MPII-MD (68,337 clips annotated with sentences) do 
not satisfy properties and We searched the sentential 
annotations from each of these two corpora for all instances 
of twelve common English verbs that represent the kinds 
of verbs whose that describe motion and changing spatial 
relations between object. 



M-VAD 

MPII-MD 

add 

89 

0/10 

120 

0/10 

carry 

74 

1/10 

273 

2/10 

lift 

435 

1/10 

374 

0/10 

load 

48 

0/10 

89 

0/10 

move 

332 

0/10 

1106 

0/10 

pick 

366 

1/10 

703 

1/10 

pour 

95 

0/10 

207 

1/10 

put 

294 

1/10 

921 

0/10 

rotate 

27 

0/10 

13 

0/10 

stack 

91 

0/10 

56 

0/10 

take 

1058 

0/10 

1786 

0/10 

unload 

1 

0/10 

11 

2/10 


We further examined ten sentences for each verb from each 
corpus, together with the corresponding video clips, and 
found that only ten out of the 240 examined satisfied prop¬ 
erties ID and [I^ Moreover, none of these ten suitable video 
clips satisfied property [V| Further, of the twelve classes {An- 
swerPhone, DriveCar, Eat, FightPerson, GetOutCar, Handshake, 
HugPerson, Kiss, Run, SitDown, SitUp, and StandUp) in the 
Hollywood 2 dataset |29l, only four {AnswerPhone, DriveCar, 
GetOutCar, and Eat) satisfy property[D Of these, three classes 
{AnswerPhone, DriveCar, and GetOutCar) always depict a 
single object class, and thus are ill suited for codetecting 
anything but the two fixed classes phone and car. The one 
remaining class (Eat) fails to satisfy property [V| This same 
situation occurs with essentially all standard datasets used 
for activity recognition, like UCF Sports t36l . 
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The standard sources of naturally occurring video for 
corpora used within the computer-vision community are 
Hollywood movies and YouTube video clips. However, Hol¬ 
lywood movies, in general, mostly involve dialog among 
actors, or generic scenery and backgrounds. At best, only 
small portions of most Hollywood movies satisfy property]^ 
and such rarely is reflected in the dialog or script, thus 
failing to satisfy property We attempted to gather a 
codetection corpus from YouTube. But again, about a dozen 
students searching YouTube for depictions of about a dozen 
common English verbs, examining hundreds of hits, found 
that fewer than 1% satisfied property and non satisfied 
property [V| Thus it is only feasible to evaluate our method 
on video that has been filmed to expressly satisfy proper¬ 
ties [DEI 

While existing datasets within the computer-vision com¬ 
munity do not satisfy properties |l||V} we believe that these 
properties are nonetheless reflective of the real natural 
world. In the real world, people interact with everyday 
objects (in their kitchen, basement, driveway, and many 
similar location) all of the time. It is just that people 
don't usually record such video, let alone make Hollywood 
movies about it or post it on YouTube. Further, people rarely 
describe such in naturally occurring text in movie scripts 
or in text uploaded to YouTube. Yet, children—and even 
adults—probably learn names of newly observed objects 
by observing people in their environment interacting with 
those objects in the context of dialog about such. Thus we 
believe that our problem, and our datasets, are a natural 
reflection of the kinds of learning that people employ to 
learn to recognize newly named objects. 

4.1 Datasets 

We evaluate our method on two datasets that do satisfy 
these properties. The first is a newly collected dataset, 
filmed to expressly satisfy properties |l||V[ This dataset was 
filmed in 6 different scenes (four in the KITCHEN, one in 
the BASEMENT, and one outside the GARAGE) of a house. 
The lighting conditions vary greatly across the different 
scenes, with the BASEMENT scene the darkest, the KITCHEN 
scene exhibiting modest lighting, and the GARAGE scene 
the brightest. Within each scene, the lighting often varies 
across different video regions. We assigned 5 actors (four 
adults and one child) with 15 distinct everyday objects 
{bowl, box, bucket, cabbage, coffee grinder, cooler, cup, gas can, 
juice, ketchup, milk, mouthwash, pineapple, squash, and wa¬ 
tering pot, see Figure [^, and had them perform different 
actions which involve interaction with these objects. No 
special instructions were given requiring that the actors 
move slowly or the that objects not be occluded. The actors 
often are partially outside the field of view. Note that the 
dataset used by Srikantha and Gall li^ does not exhibit this 
property. Indeed, their method employs human pose which 
requires that the human be sufficiently visible to estimate 
such. The filming was performed using a normal consumer 
camera that introduces motion blur on the objects when the 
actors move quickly. We downsampled the filmed videos 
to 768 X 432 and divided them into 150 short video clips, 
each clip depicting a specific event lasting between 2 and 6 
seconds at 30 fps. The 150 video clips constitute a total of 
12,509 frames. 


The second dataset is a subset of of CAD-120. Many 
of the 120 clips in CAD-120 depict sequences of actions. 
We divide those clips into subclips, each containing one 
action. We discard those that fail to satisfy any of the 
properties or |Vj leaving 75 clips. These clips have 

spatial resolution 640 x 480, each clip depicting a specific 
event lasting between 3 and 5 seconds at 30 fps. The 75 
video clips constitute a total of 8,854 frames, and contain 
5 distinct object classes, namely bowl, cereal, cup, jug, and 
microwave. 

4.2 Experimental Setup 

We employed Amazon Mechanical Turk (AMTj^to obtain 
three distinct sentences, by three different workers, for each 
video clip in each dataset, resulting in 450 sentences for the 
our new dataset and 225 sentences for our subset of CAD- 
120. AMT annotators were simply instructed to provide 
a single sentence for each video clip that described the 
primary activity depicted taking place with objects from 
a common list of object classes that occur in the entire 
dataset. The collected sentences were all converted to the 
predicates in Table using the methods of Section |3.1| We 
processed each of the two datasets three times, each time 
using a different set of sentences produced by different 
workers; each sentence was used in exactly one run of 
the experiment. Furthermore, we divided each corpus into 
codetection set, each set containing a small subset of the 
video-sentence pairs. (For our new dataset, some pairs were 
reused in different codetection sets. For CAD-120, each pair 
was used in exactly one codetection set.) Some codetection 
sets contained only videos filmed in the same background, 
while others contained a mix of videos filmed in different 
backgrounds. (The backgrounds in each codetection set for 
our new dataset are summarized in Table [^ where K, B, 
and G denote KITCHEN, BASEMENT, and GARAGE, respec¬ 
tively.) This rules out the possibility of codetecting objects by 
simple background modeling {e.g., background subtraction). 
Codetection sets were processed independently, each with 
a distinct graphical model. Table contains the number 
of video-sentence pairs and the number of vertices in the 
resulting graphical model for each codetection set of each 
corpus. 

We compared the resulting codetections against hu¬ 
man annotation. Human-annotated boxes around objects 
are provided with CAD-120. For our new dataset, these 
were obtained with AMT. We obtained five bounding-box 
annotations for each target object in each video frame. 
We asked annotators to annotate the referent of a specific 
highlighted word in the sentence associated with the video 
containing that frame. Thus the annotation reflects the se¬ 
mantic constraint implied by the sentences. This resulted 
in 5 X 289 = 1445 human annotated tracks. To measure 
how well codetections match human annotation, we use 
the loll, namely the ratio of the area of the intersection of 
two bounding boxes to the area of their union. The object 
codetection problem exhibits inherent ambiguity: different 
annotators tend to annotate different parts of an object or 
make different decisions whether to include surrounding 
background regions when the object is partially occluded. 

4. https://www.mturk.com/mturk/ 
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To quantify such ambiguity, we computed intercoder agree¬ 
ment between the human annotators for our datasets. We 
computed =10 loU scores for all box pairs produced 
by the 5 annotators in every frame and averaged them over 
the entire dataset, obtaining an overall human-human loU 
of 0.720 

We found no publicly available implementations of ex¬ 
isting video object codetection methods ||22[ |34l |35l [401 Sll/ 
thus for comparison we employ four variants of our method 
that alternatively disable different scores in our codetection 
framework. These variants help one understand the relative 
importance of different components of the framework. To¬ 
gether with our full method, they are summarized below: 



SIM 

FLOW 

SENT 

SIM+FLOW 

SIM+SENT 

(our full method) 

Flow score? 

no 

yes 

yes 

yes 

yes 

Similarity score? 

yes 

no 

no 

yes 

yes 

Sentence score? 

no 

partial 

yes 

partial 

yes 


Note that SIM uses the similarity measure but no senten¬ 
tial information. This method is similar to prior video code¬ 
tection methods that employ similarity and the proposal 
confidence score output by proposal generation methods to 
perform codetection. When the proposal confidence score 
is not discriminative, as is the case with our datasets, the 
prior methods degrade to SIM. FLOW exploits only binary 
movement information from the sentence indicating which 
objects are probably moving and which are probably not 
{i.e., using only the functions medFIMg and tempCoher in 
Table 1^, without similarity or any other sentence semantics 
(thus ''partial" in the table). SIM+FLOW adds the similarity 
score on top of FLOW. SENT uses all possible sentence 
semantics but no similarity measure. SIM+SENT is our full 
method that employs all scores. All the above variants were 
applied to each run of each codetection set of each dataset. 
Except for the changes indicated in the above table, all other 
parameters were kept constant across all such runs, thus 
resulting in an apples-to-apples comparison of the results. 
In particular, N = 500, K = 240, M = 20, and L = 15 (see 
Section]^ for details). 

4.3 Results 

We quantitatively evaluate our full method and all of 
the variants by computing loUframe/ loUobject/ loUget/ and 
loUdataset for each dataset as follows. Given an output box 
for an object in a video frame, and the corresponding 
set of annotated bounding boxes (five boxes for our new 
dataset and a single box for CAD-120), we compute loU 
scores between the output box and the annotated ones, and 
take the averaged loU score as loUframe* Then loUobject is 
computed as the average of loUframe over the output object 
track. Then, loUget is computed as the average of loUobject 
over all the object instances in a codetection set. Then, loUget 
is computed as the average of loUobject over all the object 
instances in a codetection set. Finally loUdataset is computed 
as the average of loUobject over all runs of all codetection sets 
for a dataset. 

We compute loUget for each variant on each run of each 
codetection set in each dataset as shown in Figure]^ The first 

5. Both datasets, including videos, sentences, and bounding- 
box annotations, are available at http://upplysingaoflun.ecn. 
purdue.edu/~qobi/cccp/sentence-codetection.html 


variant, SIM, using only the similarity measure, completely 
fails on this task as expected. However, combining SIM with 
either FLOW or SENT improves their performance. More¬ 
over, SENT generally outperforms FLOW, both with and 
without the addition of SIM. Weak information obtained 
from the sentential annotation that indicates whether the ob¬ 
ject is moving or stationary, but no more, i.e., the distinction 
between FLOW and SENT, is helpful in reducing the object 
proposal search space, but without the similarity measure, 
the performance is still quite poor (FLOW). Thus one can 
get moderate results by combining just SIM and FLOW. 
But to further boost performance, more sentence semantics 
is needed, i.e., replacing FLOW with SENT. Further note 
that for our new dataset, SIM+FLOW ourperforms SENT, 
but for CAD-120, the reverse is true. This seems to be the 
case because CAD-120 has greater within-class variance so 
sentential information better supports codetection than im¬ 
age similarity. However, over-constrained semantics can, at 
times, hinder the codetection process rather than help, espe¬ 
cially given the generality of our datasets. This is exhibited, 
for example, with codetection set 4 (■) on run 1 of the CAD- 
120 dataset, where SIM+FLOW outperforms SIM+SENT. 
Thus it is important to only impose weak semantics on the 
codetection process. 

Also note that there is little variation in loUget across 
different runs within a dataset. Recall that the different 
runs were performed with different sentential annotations 
produced by different workers on AMT. This indicates that 
our approach is largely insensitive to the precise sentential 
annotation. 

To evaluate the performance of our method in simply 
finding objects, we define codetection accuracy Accframe/ 
AcCobject/ Accget, and Accdataset for each dataset as follows. 
Given an loU threshold, we compute loU scores between 
an output box and the corresponding annotated boxes, and 
binarize the scores according to a specified threshold. Then 
AcCframe U Set to the maximum of the binarized scores, 
AcCobject is computed as the average of AcCframe over the 
output object track, and AcCget is computed as the average 
of AcCobject over all the object instances in a codetection set. 
Finally, we average AcCget scores over all runs of all code¬ 
tection sets for a dataset to obtain AcCdataset* hy adjusting 
the loU threshold from 0 to 1, we get an Acc-vs-threshold 
curve for each of the methods (Figure]^. It can be seen that 
the codetection accuracies of our full method under dif¬ 
ferent loU thresholds consistently outperform those of the 
variants. Our method yields an average detection accuracy 
{i.e., Accdataset) of 0.7 to 0.8 on the former (when the loU 
threshold is 0.4 to 0.3) and 0.5 to 0.6 on the latter (when the 
loU threshold is 0.4 to 0.3). Finally, we demonstrate some 
codetected object examples in Figure]^ For more examples, 
we refer the readers to our project page|^ 

5 Conclusion 

We have developed a new framework for object codetection 
in video, namely, using natural language to guide codetec¬ 
tion. Our experiments indicate that weak sentential infor¬ 
mation can significantly improve the results. This demon- 

6.http://upplysingaoflun.ecn.purdue.edu/~qobi/cccp/ 
sentence-ccdetecticn.html 
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Fig. 4. loU scores for different variants on different runs of different codetection sets on each dataset, (top) Our new dataset, codetection sets ■ 1, 
■ 2, ■ 3, ■ 4, ■ 5, ■ 6, ■ 7, ■ 8, ■ 9, and ■ 10. (bottom) Our subset of CAD-120, codetection sets ■ 1, ■ 2, ■ 3, ■ 4, and ■ 5. 


strates that natural language, when combined with typical 
computer-vision problems, could provide the capability of 
high-level reasoning that yields better solutions to these 
problems. 
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Fig. 6. Examples of the 15 codetected object classes in our new dataset (top) and the 5 codetected object classes in our subset of CAD-120 (bottom). 
Note that in some examples the objects are occluded, rotated, poorly lit, or blurred due to motion, but they are still successfully codetected. (For 
demonstration purposes, the original output detections are slightly enlarged to include the surrounding context; zoom in on the screen for the best 
view). 
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